The goal of this experiment is to check whether clustering can be used as a feature extraction method for classification. The basic premise is to cluster the dataset into k clusters and use each cluster as a new feature or create a single additional feature reflecting the cluster assignment. The values of these features would be either binary (example assigned to a cluster or not), continuous (degree of cluster membership), or discrete (cluster number, in case of adding a single feature). We would like to investigate:


  1. Experiment settings
supportedDatasets = c("wine", "breast-cancer-wisconsin", "yeast", "glass", "ecoli",
             "vowel-context", "iris", "pima-indians-diabetes", "sonar.all",
             "image-segmentation", "ionosphere", "letter", "magic", "optdigits",
             "pendigits", "spectrometer", "statlog-satimage", "statlog-segment", "statlog-vehicle")

dataset.name = "pima-indians-diabetes"
classifier = "svmLinear"

feature.types = c("factor", "binary", "distance", "binaryFS", "binaryDist", "revDistSquared", "distFS")
feature.type = feature.types[3]
measures = c("euclidean", "mahalanobis")
measure = measures[1]
clusteringPerClass = FALSE
newFeaturesOnly = FALSE
number_of_clusters = 8
scaling = FALSE;

train_testSplitRatio = 0.5
folds = 5
repeats = 2

set.seed(23)
  1. Read data

For now, let’s use the pima-indians-diabetes dataset.

##        V1                V2                V3                V4         
##  Min.   :-1.1411   Min.   :-3.7812   Min.   :-3.5703   Min.   :-1.2874  
##  1st Qu.:-0.8443   1st Qu.:-0.6848   1st Qu.:-0.3671   1st Qu.:-1.2874  
##  Median :-0.2508   Median :-0.1218   Median : 0.1495   Median : 0.1544  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6395   3rd Qu.: 0.6054   3rd Qu.: 0.5629   3rd Qu.: 0.7186  
##  Max.   : 3.9040   Max.   : 2.4429   Max.   : 2.7327   Max.   : 4.9187  
##        V5                V6                  V7                V8         
##  Min.   :-0.6924   Min.   :-4.057829   Min.   :-1.1888   Min.   :-1.0409  
##  1st Qu.:-0.6924   1st Qu.:-0.595191   1st Qu.:-0.6885   1st Qu.:-0.7858  
##  Median :-0.4278   Median : 0.000941   Median :-0.2999   Median :-0.3606  
##  Mean   : 0.0000   Mean   : 0.000000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4117   3rd Qu.: 0.584390   3rd Qu.: 0.4659   3rd Qu.: 0.6598  
##  Max.   : 6.6485   Max.   : 4.452906   Max.   : 5.8797   Max.   : 4.0611  
##  Class  
##  0:500  
##  1:268  
##         
##         
##         
## 

The number of classes in this dataset is 2.

  1. Select the number of clusters

Since our goal is to use the clusters as attributes for classification, it makes sense to use more clusters than there are classes in the dataset.

The number of clusters found in pima-indians-diabetes dataset is 7.

  1. Clustering

K-memans clustering

Clustering dataset into 8 clusters.

Affinity propagation clustering

Spectral clustering

  1. Classification

Without new features

  • Training
## Support Vector Machines with Linear Kernel 
## 
## 384 samples
##   8 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 307, 308, 307, 307, 307, 307, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7552119  0.4215967
## 
## Tuning parameter 'C' was held constant at a value of 1
  • Testing
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 217  53
##          1  33  81
##                                          
##                Accuracy : 0.776          
##                  95% CI : (0.731, 0.8168)
##     No Information Rate : 0.651          
##     P-Value [Acc > NIR] : 7.049e-08      
##                                          
##                   Kappa : 0.4894         
##  Mcnemar's Test P-Value : 0.04048        
##                                          
##             Sensitivity : 0.8680         
##             Specificity : 0.6045         
##          Pos Pred Value : 0.8037         
##          Neg Pred Value : 0.7105         
##              Prevalence : 0.6510         
##          Detection Rate : 0.5651         
##    Detection Prevalence : 0.7031         
##       Balanced Accuracy : 0.7362         
##                                          
##        'Positive' Class : 0              
## 

With new k-means features

  • Training
## Support Vector Machines with Linear Kernel 
## 
## 384 samples
##  16 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 307, 308, 307, 307, 307, 307, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7500342  0.4227964
## 
## Tuning parameter 'C' was held constant at a value of 1
  • Testing
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 211  52
##          1  39  82
##                                           
##                Accuracy : 0.763           
##                  95% CI : (0.7173, 0.8047)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 1.385e-06       
##                                           
##                   Kappa : 0.4664          
##  Mcnemar's Test P-Value : 0.2084          
##                                           
##             Sensitivity : 0.8440          
##             Specificity : 0.6119          
##          Pos Pred Value : 0.8023          
##          Neg Pred Value : 0.6777          
##              Prevalence : 0.6510          
##          Detection Rate : 0.5495          
##    Detection Prevalence : 0.6849          
##       Balanced Accuracy : 0.7280          
##                                           
##        'Positive' Class : 0               
## 
  • Variable importance

With new affinity propagation features

  • Training
## Support Vector Machines with Linear Kernel 
## 
## 384 samples
##  42 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 308, 307, 307, 307, 307, 307, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7343985  0.3870412
## 
## Tuning parameter 'C' was held constant at a value of 1
  • Testing
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 214  53
##          1  36  81
##                                           
##                Accuracy : 0.7682          
##                  95% CI : (0.7227, 0.8095)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 4.391e-07       
##                                           
##                   Kappa : 0.4744          
##  Mcnemar's Test P-Value : 0.08989         
##                                           
##             Sensitivity : 0.8560          
##             Specificity : 0.6045          
##          Pos Pred Value : 0.8015          
##          Neg Pred Value : 0.6923          
##              Prevalence : 0.6510          
##          Detection Rate : 0.5573          
##    Detection Prevalence : 0.6953          
##       Balanced Accuracy : 0.7302          
##                                           
##        'Positive' Class : 0               
## 
  • Variable importance

With new spectral clustering features

  • Training
## Support Vector Machines with Linear Kernel 
## 
## 384 samples
##  16 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 307, 307, 308, 307, 307, 307, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.7474539  0.4222641
## 
## Tuning parameter 'C' was held constant at a value of 1
  • Testing
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 210  51
##          1  40  83
##                                           
##                Accuracy : 0.763           
##                  95% CI : (0.7173, 0.8047)
##     No Information Rate : 0.651           
##     P-Value [Acc > NIR] : 1.385e-06       
##                                           
##                   Kappa : 0.4683          
##  Mcnemar's Test P-Value : 0.2945          
##                                           
##             Sensitivity : 0.8400          
##             Specificity : 0.6194          
##          Pos Pred Value : 0.8046          
##          Neg Pred Value : 0.6748          
##              Prevalence : 0.6510          
##          Detection Rate : 0.5469          
##    Detection Prevalence : 0.6797          
##       Balanced Accuracy : 0.7297          
##                                           
##        'Positive' Class : 0               
## 
  • Variable importance

  1. Experiments

5.1. Comparative evaluation

Let us now perform an experiment with different classifiers over multiple datasets. For each setting we will test classification without added features and with features generated using affinity propagation and k-means. The experimental methodology is organized as follows. Each dataset is scaled and split into training and testing sets with split ratio equal to 0.5. Next, new features are added to the datasets using both clustering algorithms. Afterwards, for each classifier, three models are trained on original, afinity propagation-enriched, and k-means-enriched training sets. Training is performed using 5-fold cross-validation repeated 2 times. Finally, the trained models are tested on corresponding test sets and evaluated using accuracy and Kohen’s kappa.

Summary of affinity propagation results:

## # A tibble: 9 x 4
##   Classifier Better Worse Equal
##   <fct>       <int> <int> <int>
## 1 pda            14     4     1
## 2 svmLinear      14     4     1
## 3 bayesglm       13     2     4
## 4 multinom       13     3     3
## 5 rpart          12     3     4
## 6 PART            9     8     2
## 7 gbm             6     8     3
## 8 knn             4    14     0
## 9 svmRadial       3    15     1

Summary of k-means results:

## # A tibble: 9 x 4
##   Classifier Better Worse Equal
##   <fct>       <int> <int> <int>
## 1 bayesglm       13     2     4
## 2 rpart          13     2     4
## 3 svmLinear      12     5     2
## 4 pda            11     7     1
## 5 multinom       10     7     2
## 6 PART            7    10     2
## 7 knn             6    11     1
## 8 gbm             5     7     5
## 9 svmRadial       4    12     3

To check whether there are any significant differences between the proposed approaches, we performed the Friedman statistical test with a post-hoc Nemenyi test, the results of which are presented below for linear SVM.

##       no       km       ap 
## 1.552632 1.947368 2.500000
## 
##  Friedman rank sum test
## 
## data:  friedmanData
## Friedman chi-squared = 9.6176, df = 2, p-value = 0.008157
## 
##  Pairwise comparisons using Nemenyi multiple comparison test 
##              with q approximation for unreplicated blocked data 
## 
## data:  friedmanData 
## 
##    no     km    
## km 0.4433 -     
## ap 0.0098 0.2039
## 
## P value adjustment method: none

5.2 Cluster representation test

After clustering of the training examples, there are many ways we can use this information to create new features. In this experiment, we will compare several methods to determine which one works best. The two main options are encoding each cluster as a binary or a numerical feature. In the binary case, the values indicate wheter examples belong to a given cluster (1) or not (0). In the numerical case, the values indicate the distance from each example to a given cluster representative. There are also many possible variations of these two variants. All in all, We will consider the following options:

##                    Dataset       Bin      Dist  Bin.Dist  RevDist2
## 1                     wine 0.9318182 1.0000000 0.9318182 0.9431818
## 2  breast-cancer-wisconsin 0.9501466 0.9648094 0.9589443 0.9384164
## 3                    yeast 0.6035183 0.6143437 0.5588633 0.5520974
## 4                    glass 0.6476190 0.6761905 0.5809524 0.6000000
## 5                    ecoli 0.8493976 0.8493976 0.8433735 0.8253012
## 6            vowel-context 0.4969697 0.8666667 0.4868687 0.4222222
## 7                     iris 0.8533333 0.9466667 0.8533333 0.7333333
## 8    pima-indians-diabetes 0.7005208 0.7734375 0.7031250 0.7265625
## 9                sonar.all 0.7087379 0.8155340 0.6990291 0.4951456
## 10      image-segmentation 0.8597403 0.9255411 0.8562771 0.5627706
## 11              ionosphere 0.8457143 0.9257143 0.8171429 0.9314286
## 12                  letter 0.7342139 0.9224457 0.7262083 0.9258481
## 13                   magic 0.8005258 0.8608833 0.8006309 0.8422713
## 14               optdigits 0.9334283 0.9704521 0.9291563 0.9658241
## 15               pendigits 0.9685054 0.9899873 0.9675951 0.9797925
## 16            spectrometer 0.3694779 0.5381526 0.3493976 0.3534137
## 17        statlog-satimage 0.8532338 0.8905473 0.8507463 0.8886816
## 18         statlog-segment 0.8649351 0.9203463 0.8510823 0.8805195
## 19         statlog-vehicle 0.5592417 0.7085308 0.5592417 0.6398104

The results clearly indicate that the distance-based approach is superior to all other variants. To further verify this observation we performed the Friedman and the post-hoc Nemenyi tests, which confirm that, indeed, distance-based approach is significantly better than the alternatives.

## Bin.Dist      Bin RevDist2     Dist 
## 1.605263 2.263158 2.263158 3.868421
## 
##  Friedman rank sum test
## 
## data:  friedmanData
## Friedman chi-squared = 32.435, df = 3, p-value = 4.236e-07
## 
##  Pairwise comparisons using Nemenyi multiple comparison test 
##              with q approximation for unreplicated blocked data 
## 
## data:  friedmanData 
## 
##          Bin.Dist Bin     RevDist2
## Bin      0.39545  -       -       
## RevDist2 0.39545  1.00000 -       
## Dist     3.9e-07  0.00073 0.00073 
## 
## P value adjustment method: none

5.3 Comparison of clustering algorithms

##                    Dataset        AP        KM        SC
## 1                     wine 1.0000000 1.0000000 0.9886364
## 2  breast-cancer-wisconsin 0.9648094 0.9706745 0.9736070
## 3                    yeast 0.6143437 0.5953992 0.6089310
## 4                    glass 0.6761905 0.6761905 0.6952381
## 5                    ecoli 0.8493976 0.8433735 0.8433735
## 6            vowel-context 0.8666667 0.8545455 0.8262626
## 7                     iris 0.9466667 0.9466667 0.8933333
## 8    pima-indians-diabetes 0.7734375 0.7812500 0.7630208
## 9                sonar.all 0.8155340 0.7281553 0.7378641
## 10      image-segmentation 0.9255411 0.9307359 0.9341991
## 11              ionosphere 0.9257143 0.9485714 0.9314286
## 12                  letter 0.9224457 0.9209447 0.9232463
## 13                   magic 0.8608833 0.8629863 0.8664564
## 14               optdigits 0.9704521 0.9747241 0.9729441
## 15               pendigits 0.9899873 0.9901693 0.9885309
## 16            spectrometer 0.5381526 0.5702811 0.5542169
## 17        statlog-satimage 0.8905473 0.8893035 0.8840174
## 18         statlog-segment 0.9203463 0.9316017 0.9419913
## 19         statlog-vehicle 0.7085308 0.7037915 0.7109005
##       AP       SC       KM 
## 1.921053 2.026316 2.052632
## 
##  Friedman rank sum test
## 
## data:  friedmanData
## Friedman chi-squared = 0.19444, df = 2, p-value = 0.9074
## 
##  Pairwise comparisons using Nemenyi multiple comparison test 
##              with q approximation for unreplicated blocked data 
## 
## data:  friedmanData 
## 
##    AP   SC  
## SC 0.94 -   
## KM 0.91 1.00
## 
## P value adjustment method: none

5.4 Global vs local clustering

The intuition behind our approach is that clustering of training examples regardless of their class could help generalization through the use of global information. We refer to this approach as global. However, one could argue for an alternative approach in which clustering is performed per class. We call this approach local. This way, we are still adding some global information about distant objects’ similarity, however, with the additional potential benefit of modeling the space occupied by each class. To verify which of these approaches is better, we compared these two approaches empirically.

To make the comparison meaningful, we have to ensure an equal number of clusters in both approaches, to make sure that the results solely rely on the generated clusters and not their quantity. To achieve this goal, the experiment was performed as follows. First, we performed the clustering separately in each class using affinity propagation to automatically determine the number of clusters. Next, we performed the same experiment using the global approach using k-means with the number of clusters equal to the total number of clusters in all classes. The results of this experiment are presented below.

##                    Dataset    Global     Local
## 1                     wine 0.9772727 0.9659091
## 2  breast-cancer-wisconsin 0.9677419 0.9706745
## 3                    yeast 0.6075778 0.6143437
## 4                    glass 0.6761905 0.6190476
## 5                    ecoli 0.8373494 0.8132530
## 6            vowel-context 0.8808081 0.8323232
## 7                     iris 0.9466667 0.8533333
## 8    pima-indians-diabetes 0.7708333 0.7760417
## 9                sonar.all 0.7864078 0.7961165
## 10      image-segmentation 0.9463203 0.9333333
## 11              ionosphere 0.9371429 0.8571429
## 12                  letter 0.9277494 0.9224457
## 13                   magic 0.8622503 0.8616193
## 14               optdigits 0.9768601 0.9708081
## 15               pendigits 0.9908975 0.9898052
## 16        statlog-satimage 0.8980100 0.8889925
## 17         statlog-segment 0.9515152 0.9532468
## 18         statlog-vehicle 0.7203791 0.7322275

The results of this experiment do not show a clear winner, although the global approach works better in more cases than the local. Nevertheless, the Wilcoxon signed ranks test was unable to find a significant difference between these two approaches at alpha=0.05, so we conclude that both approaches are equally valid. Given the above, we lean towards the global approach as it generally detects smaller number of clusters and, therefore, generates less new features which, in turn, helps generalization.

## 
##  Wilcoxon rank sum test
## 
## data:  result.display$Global and result.display$Local
## W = 180, p-value = 0.2921
## alternative hypothesis: true location shift is greater than 0

5.5 Supervisec vs semi-supervised learning

Since the method discussed in this research creates new features regardless of the decision attribute, it is very easy to use it in a semi-supervised setting. Therefore, in this experiment we would like to check whether clustering on both training and testing data will produce better features.

##                    Dataset Semi.supervised Supervised
## 1                     wine       0.9431818  1.0000000
## 2  breast-cancer-wisconsin       0.9618768  0.9648094
## 3                    yeast       0.5764547  0.6143437
## 4                    glass       0.7142857  0.6761905
## 5                    ecoli       0.8493976  0.8493976
## 6            vowel-context       0.9191919  0.8666667
## 7                     iris       0.9200000  0.9466667
## 8    pima-indians-diabetes       0.7656250  0.7734375
## 9                sonar.all       0.8446602  0.8155340
## 10      image-segmentation       0.9376623  0.9255411
## 11              ionosphere       0.9314286  0.9257143
## 12                  letter       0.9316522  0.9224457
## 13                   magic       0.8640379  0.8608833
## 14               optdigits       0.9782841  0.9704521
## 15               pendigits       0.9912616  0.9899873
## 16            spectrometer       0.5421687  0.5381526
## 17        statlog-satimage       0.8958333  0.8905473
## 18         statlog-segment       0.9393939  0.9203463
## 19         statlog-vehicle       0.7180095  0.7085308
## Warning in wilcox.test.default(result.display$Semi.supervised,
## result.display$Supervised, : cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  result.display$Semi.supervised and result.display$Supervised
## W = 184.5, p-value = 0.4593
## alternative hypothesis: true location shift is greater than 0

5.6 Sensitivity test

In this experiment we will examine how the number of clusters influences the quality of classification. We will only analyze the new features and discard the original ones. In addition to test set accuracy, we will also report training set accuracy to check when the model starts overfitting due to high dimensionality of the new feature space. We will vary the number of clusters from 1 to a ridiculus 200, just to observe what impact will it exactly have on classification quality. Let us begin with linear SVM on pima-indians-diabetes dataset.

As we can see, after reaching a test accuracy of approximately 78% with around 20 clusters, the test accuracy stops improving and starts diverging from the training accuracy at around k=35 mark, while the training accuracy keeps getting better, which is a clear sign of overfitting. Since SVM is a reasonably robust algorithm, this effect isn’t as dramatic as one would expect, so in order to emphasize this issue let us use a simple logistic regression classifier to get a clear indication where the overfitting actually begins.

Now we can observe this effect even clearer, with first significant differences appearing at around k=30 and the two lines clearly starting to diverge after k=40 mark. Interestingly, affinity propagation picked 35 as the number of clusters for this dataset, which (judging by these plots) seems just about right!

Now let’s look how this experiment turns out for other datasets with linear SVM.

## [1] "wine"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "breast-cancer-wisconsin"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "yeast"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "glass"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "ecoli"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "vowel-context"

## [1] "iris"

## [1] "pima-indians-diabetes"

## [1] "sonar.all"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "image-segmentation"

## [1] "ionosphere"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "optdigits"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "pendigits"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "spectrometer"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "statlog-satimage"
## Warning: Removed 1 rows containing missing values (geom_path).

## [1] "statlog-vehicle"
## Warning: Removed 1 rows containing missing values (geom_path).

5.7 Is there any differnce between high and low quality features?

Since we have already established in our sensitivity experiment that the number of clusters has a clear influence on the quality of classification, let us now check wherer the quality of the new features as treated separately makes any difference. In order to do so, we will cluster the dataset into a certain number of clusters, encode the clusters as new features, and evaluate the quality of each new feature using Fisher Score. Next, we will add new features one by one in order of their increasing and decreasing quality in order to observe the effect they have on classification accuracy. Again, we will use linear SVM and pima-indians-diabetes dataset with k=35 (as determined by affinity propagation).

Ultimately what we are doing in our approach is selecting points in n-dimensional space and calculating the distances between all data points and these new points and encoding these points as new features. This plot proves (at least to some degree) that the choice of these points matters and has a high impact on the quality of classification. On the diagram, the blue line represents classfication quality when adding new features according to their descending fisher score, while the green line represents the same in an ascending order. The lines obvoiusly meet at the end, since in both cases in the end all features are used for classification. However, what happens before that is a clear indication that some points (clusters) hold more information than others.

5.8 Comparison between points selected by clustering and points uniformly distributed across feature space.

Since in the previous point we have established that the positioning of the points in the feature space matters, in this experiment we will compare features generated by clustering with features generated by adding uniformly distributed points in the feature space.

Now this is very interesting! Granted this is only for a single dataset with total number of features picked based on affinity propagation clustering, if this result holds for other datasets it will mean that clustering itself does not produce any additional useful information - it is the proposed distance-from-points-based feature transformation that causes the improvements in classification quality. Let’s test the same hypothesis on a larger dataset (optdigits).

Although the experiment has not been repeated 10 times, as was in the previous case, some characteristic is stil visible and it seems to be reversed! I guess we have to check more datasets to verify the hipothesis about uniformly distributed points.